Introduction

This Rmarkdown script (and corresponding the TAB-separated CSV input data files InfoRateData.csv and AutomaticSylDetect.csv, and the resulting HTML document) contain the full analysis and plotting code accompanying the paper Different languages, similar encoding efficiency: comparable information rates across the human communicative niche.

The data

For more information on the data, please see Oh (2015). There are in total 17 languages (see the Table below).

The Oral Corpus

The oral corpus is based on a subset of the Multext (Multilingual Text Tools and Corpora) parallel corpus (Campione & Véronis, 1998) in British English, German, and Italian. The material consists of 15 short texts of 3-5 semantically connected sentences carefully translated by a native speaker in each language.

For the other 14 languages, two of the authors supervised the translation and recording of new datasets. All participants were native speakers of the target language, with a focus on a specific variety of the language when possible – e.g. Mandarin spoken in Beijing, Serbian in Belgrade and Korean in Seoul. No strict control on age or on the speakers’ social diversity was performed, but speakers were mainly students or members of academic institutions. Speakers were asked to read three times (first silently and then loudly twice) each text. The texts were presented one by one on the screen in random order, in a self-paced reading paradigm. This way, speakers familiarized themselves with the text and reduce their reading errors. The second loud recording was analyzed in this study.

With these, we the oral corpus contains 2288 texts read by 170 coming from 17 languages and 15 texts.

For each language and text we have the following number of syllables (NS):

Number of syllables (NS) per language and text.
Language Text NS
CAT O1 88
CAT O2 118
CAT O3 139
CAT O4 142
CAT O6 99
CAT O8 131
CAT O9 81
CAT P0 122
CAT P1 127
CAT P2 118
CAT P3 131
CAT P8 117
CAT P9 119
CAT Q0 127
CAT Q1 93
CMN O1 60
CMN O2 73
CMN O3 73
CMN O4 80
CMN O6 64
CMN O8 69
CMN O9 57
CMN P0 100
CMN P1 70
CMN P2 65
CMN P3 77
CMN P8 78
CMN P9 76
CMN Q0 68
CMN Q1 57
DEU O1 86
DEU O4 87
DEU O6 82
DEU O9 60
DEU P0 111
DEU Q0 113
DEU O2 117
DEU O3 86
DEU O8 115
DEU P1 105
DEU P2 90
DEU P3 106
DEU P8 87
DEU P9 92
DEU Q1 94
ENG O1 70
ENG O2 86
ENG O3 85
ENG O4 84
ENG O6 67
ENG O8 84
ENG O9 67
ENG P1 105
ENG P2 76
ENG P3 91
ENG Q0 88
ENG Q1 62
ENG P0 92
ENG P8 77
ENG P9 91
EUS O1 102
EUS O2 106
EUS O3 108
EUS O4 135
EUS O6 107
EUS O8 138
EUS O9 67
EUS P0 142
EUS P1 117
EUS P2 109
EUS P3 119
EUS P8 121
EUS P9 106
EUS Q0 121
EUS Q1 87
FIN O1 91
FIN O2 110
FIN O3 110
FIN O4 123
FIN O6 84
FIN O8 111
FIN O9 74
FIN P0 117
FIN P1 123
FIN P2 98
FIN P3 119
FIN P8 96
FIN P9 108
FIN Q0 113
FIN Q1 83
FRA O1 87
FRA O2 106
FRA O3 95
FRA O4 93
FRA O6 77
FRA O8 94
FRA O9 65
FRA P0 100
FRA P1 104
FRA P2 88
FRA P3 107
FRA P8 95
FRA P9 92
FRA Q0 99
FRA Q1 68
HUN O1 89
HUN O2 112
HUN O3 100
HUN O4 124
HUN O6 72
HUN O8 99
HUN O9 82
HUN P0 122
HUN P1 112
HUN P2 105
HUN P3 113
HUN P8 101
HUN P9 117
HUN Q0 103
HUN Q1 98
ITA O4 83
ITA O6 86
ITA O8 109
ITA O9 68
ITA P0 123
ITA P8 110
ITA P9 100
ITA Q1 106
ITA O1 89
ITA O2 113
ITA O3 100
ITA P1 109
ITA P2 117
ITA P3 111
ITA Q0 110
JPN O1 119
JPN O2 162
JPN O3 153
JPN O4 154
JPN O6 129
JPN O8 159
JPN O9 83
JPN P0 156
JPN P1 131
JPN P2 142
JPN P3 152
JPN P8 117
JPN P9 139
JPN Q0 150
JPN Q1 126
KOR O1 86
KOR O2 105
KOR O3 116
KOR O4 136
KOR O6 107
KOR O8 132
KOR O9 86
KOR P0 133
KOR P1 124
KOR P2 128
KOR P3 117
KOR P8 115
KOR P9 127
KOR Q0 112
KOR Q1 115
SPA O1 94
SPA O2 135
SPA O3 111
SPA O4 152
SPA O6 103
SPA O8 136
SPA O9 79
SPA P0 153
SPA P1 137
SPA P2 120
SPA P3 119
SPA P8 93
SPA P9 119
SPA Q0 126
SPA Q1 81
SRP O1 87
SRP O2 99
SRP O3 120
SRP O4 128
SRP O6 98
SRP O8 110
SRP O9 75
SRP P0 137
SRP P1 121
SRP P2 98
SRP P3 129
SRP P8 110
SRP P9 111
SRP Q0 102
SRP Q1 89
THA O1 64
THA O2 74
THA O3 85
THA O4 110
THA O6 78
THA O8 93
THA O9 60
THA P0 103
THA P1 96
THA P2 79
THA P3 95
THA P8 81
THA P9 77
THA Q0 73
THA Q1 56
TUR O1 108
TUR O2 139
TUR O3 120
TUR O4 143
TUR O6 99
TUR O8 130
TUR O9 79
TUR P0 102
TUR P1 102
TUR P2 116
TUR P3 157
TUR P8 94
TUR P9 128
TUR Q0 107
TUR Q1 88
VIE O1 49
VIE O2 93
VIE O3 55
VIE O4 89
VIE O6 80
VIE O8 81
VIE O9 52
VIE P0 102
VIE P1 77
VIE P2 81
VIE P3 90
VIE P8 75
VIE P9 61
VIE Q0 56
VIE Q1 55
YUE O1 59
YUE O2 80
YUE O3 96
YUE O4 69
YUE O6 80
YUE O8 92
YUE O9 56
YUE P0 102
YUE P1 88
YUE P2 78
YUE P3 83
YUE P8 77
YUE P9 90
YUE Q0 88
YUE Q1 73

Text Corpus

Text datasets were acquired from various sources as illustrated in the Table below. After an initial data curation, each dataset was phonetically transcribed and automatically syllabified by a rule-based program written by one of the authors, except in the following cases:

  1. when syllabification was already provided with the dataset (English, French, German, and Vietnamese for the multisyllabic words);
  2. when the corpus was syllabified by an automatic grapheme-to-phoneme converter (Catalan, Spanish, and Thai).

Additionally, no syllabification was required for Sino-Tibetan languages (Cantonese and Mandarin Chinese) since one ideogram corresponds to one syllable.

Language Family ISO 639-3 Corpus
Basque Basque EUS E-Hitz (Perea et al., 2006)
British English Indo-European ENG WebCelex (MPI for Psycholinguistics)
Cantonese Sino-Tibetan YUE A linguistic corpus of mid-20th century Hong Kong Cantonese
Catalan Indo-European CAT Frequency dictionary (Zséder et al., 2012)
Finnish Uralic FIN Finnish Parole Corpus
French Indo-European FRA Lexique 3.80 (New et al., 2001)
German Indo-European DEU WebCelex (MPI for Psycholinguistics)
Hungarian Uralic HUN Hungarian National Corpus (Váradi, 2002)
Italian Indo-European ITA The Corpus PAISÀ (Lyding et al., 2014)
Japanese Japanese JPN Japanese Internet Corpus (Sharoff, 2006)
Korean Korean KOR Leipzig Corpora Collection (LCC)
Mandarin Chinese Sino-Tibetan CMN Chinese Internet Corpus (Sharoff, 2006)
Serbian Indo-European SRP Frequency dictionary (Zséder et al., 2012)
Spanish Indo-European SPA Frequency dictionary (Zséder et al., 2012)
Thai Tai-Kadai THA Thai National Corpus (TNC)
Turkish Turkic TUR Leipzig Corpora Collection (LCC)
Vietnamese Austroasiatic VIE VNSpeechCorpus (Le et al., 2004)

Dataset structure

The data is structured as follows:

  • Language is the unique language ISO 639-3 ID with 17 possible values “CAT”, “CMN”, “DEU”, “ENG”, “EUS”, “FIN”, “FRA”, “HUN”, “ITA”, “JPN”, “KOR”, “SPA”, “SRP”, “THA”, “TUR”, “VIE”, “YUE”;
  • Language family is the language family with 9 possible values “Austroasiatic”, “Basque”, “Indo-European”, “Japanese”, “Korean”, “Sino-Tibetan”, “Tai-Kadai”, “Turkic”, “Uralic”;
  • Text is the text identifier (similar across languages and speakers) with 15 possible values “O1”, “O2”, “O3”, “O4”, “O6”, “O8”, “O9”, “P0”, “P1”, “P2”, “P3”, “P8”, “P9”, “Q0”, “Q1”;
  • Speaker is the unique speaker’s ID (there are 170 in the dataset);
  • Sex is the speaker’s sex;
  • Age is the speaker’s age (available only for a subset of 132 speakers);
  • Duration is the texts’ duration in seconds as spoken by the given speaker (pauses longer than 150ms being excluded);
  • NS is the texts’ number of of syllables (the same for a given text in a given language);
  • SR is the speech rate (syllables/second) for a given text spoken by a given speaker;
  • ShE (resp. ID) is the first-order (resp. second-order) language-level entropy estimate;
  • ShIR and IR are text * speaker properties obtained by multiplying SR by ShE and ID respectively.

Technical notes on regressions

We use throughout sum contrasts for the factor IVs, which are orthogonal contrasts which compare every level of the IV to the overall mean (for example, for a two-levels factor such as Sex we do not compare Males with Females but each with their overall mean, which is included in the intercept). However, in R the contr.sum() function used to define this contrasts produces level names that are very uninformative, so we explicit these below (please note that in the model outputs the last level is usually not shown):

  • Sex: Sex1 = F, Sex2 = M (the last level, Sex2 is usually not displayed);
  • Text: Text1 = O1, Text2 = O2, Text3 = O3, Text4 = O4, Text5 = O6, Text6 = O8, Text7 = O9, Text8 = P0, Text9 = P1, Text10 = P2, Text11 = P3, Text12 = P8, Text13 = P9, Text14 = Q0, Text15 = Q1 (the last level, Text15 is usually not displayed);
  • Language: Language1 = CAT, Language2 = CMN, Language3 = DEU, Language4 = ENG, Language5 = EUS, Language6 = FIN, Language7 = FRA, Language8 = HUN, Language9 = ITA, Language10 = JPN, Language11 = KOR, Language12 = SPA, Language13 = SRP, Language14 = THA, Language15 = TUR, Language16 = VIE, Language17 = YUE (the last level, Language17 is usually not displayed);
  • Family: Family1 = Austroasiatic, Family2 = Basque, Family3 = Indo-European, Family4 = Japanese, Family5 = Korean, Family6 = Sino-Tibetan, Family7 = Tai-Kadai, Family8 = Turkic, Family9 = Uralic (the last level, Family9 is usually not displayed);

Exploratory plots and summaries

Speaker characteristics

Distribution of speaker number and characteristics (sex and age) by language. Lng=language, # spkrs=number of speakers, % fem=percent female speakers, # age=number of speakers with age info, and for those, mean(age)=mean age, sd(age)=standard deviation of age, and actual ages=the sorted ages.
Lng # spkrs % fem # age mean(age) sd(age) actual ages
CAT 10 50 10 35.4 9.2 (21, 28, 28, 29, 31, 39, 42, 42, 44, 50)
CMN 10 50 9 23.1 4.5 (19, 19, 19, 19, 24, 24, 25, 28, 31)
DEU 10 50 0 NaN NaN ()
ENG 10 50 0 NaN NaN ()
EUS 10 50 10 28.0 4.9 (19, 22, 26, 27, 28, 29, 30, 31, 32, 36)
FIN 10 50 10 33.2 11.0 (16, 22, 26, 28, 30, 35, 37, 41, 45, 52)
FRA 10 50 10 32.5 7.7 (24, 25, 25, 27, 28, 36, 36, 37, 41, 46)
HUN 10 50 10 39.3 15.8 (17, 27, 27, 31, 33, 39, 42, 51, 57, 69)
ITA 10 50 0 NaN NaN ()
JPN 10 50 10 30.6 12.8 (20, 20, 21, 22, 22, 28, 29, 40, 51, 53)
KOR 10 50 10 28.6 10.6 (16, 19, 19, 19, 28, 31, 33, 35, 36, 50)
SPA 10 50 10 33.7 10.1 (21, 22, 26, 28, 30, 32, 42, 42, 44, 50)
SRP 10 50 10 30.6 7.8 (19, 21, 23, 30, 31, 32, 34, 34, 38, 44)
THA 10 50 10 30.1 5.7 (23, 23, 27, 28, 30, 31, 31, 32, 33, 43)
TUR 10 50 7 32.6 7.2 (24, 25, 30, 31, 37, 37, 44)
VIE 10 50 6 27.2 4.1 (21, 25, 26, 28, 31, 32)
YUE 10 50 10 22.0 1.5 (20, 20, 21, 21, 22, 22, 23, 23, 24, 24)

NS

NS: exploratory plots.

NS: exploratory plots.

mean=101.196, median=100, sd=24.703, CV=0.244, min=49, max=162, kurtosis=2.484, skewness=0.177.

SR

SR: exploratory plots.

SR: exploratory plots.

SR per speaker.

SR per speaker.

SR by Sex and Age across Languages.

SR by Sex and Age across Languages.

SR by Sex, Age and Language.

SR by Sex, Age and Language.

SR by language.

SR by language.

mean=6.631, median=6.777, sd=1.148, CV=0.173, min=3.589, max=9.492, kurtosis=2.408, skewness=-0.168.

ShE and ID

ShE and ID: exploratory plots.

ShE and ID: exploratory plots.

ShE vs ID.

ShE vs ID.


    Pearson's product-moment correlation

data:  tmp1$ShE and tmp1$ID
t = 2.0326, df = 15, p-value = 0.06019
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02052914  0.77274779
sample estimates:
      cor 
0.4647009 

    Spearman's rank correlation rho

data:  tmp1$ShE and tmp1$ID
S = 451.88, p-value = 0.07259
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4462208 

    Paired t-test

data:  tmp1$ShE and tmp1$ID
t = 11.635, df = 16, p-value = 3.213e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.158040 3.119607
sample estimates:
mean of the differences 
               2.638824 

ShE:

mean=8.621, median=8.69, sd=0.904, CV=0.105, min=6.07, max=9.83, kurtosis=4.665, skewness=-1.122.

ID:

mean=6.009, median=5.56, sd=0.883, CV=0.147, min=4.83, max=8.02, kurtosis=2.53, skewness=0.747.

ShIR and IR

ShIR: exploratory plots.

ShIR: exploratory plots.

ShIR per speaker.

ShIR per speaker.

ShIR by Sex and Age across Languages.

ShIR by Sex and Age across Languages.

ShIR by Sex, Age and Language.

ShIR by Sex, Age and Language.

ShIR by language.

ShIR by language.

IR: exploratory plots.

IR: exploratory plots.

IR per speaker.

IR per speaker.

IR by Sex and Age across Languages.

IR by Sex and Age across Languages.

IR by Sex, Age and Language.

IR by Sex, Age and Language.

IR by language.

IR by language.

ShIR:

mean=56.709, median=57.207, sd=9.35, CV=0.165, min=32.772, max=89.235, kurtosis=2.444, skewness=0.079.

IR:

mean=39.153, median=39.13, sd=5.097, CV=0.13, min=25.631, max=60.692, kurtosis=3.622, skewness=0.325.

SR and ID

SR vs ID

SR vs ID


    Pearson's product-moment correlation

data:  info.rate.data$SR and info.rate.data$ID
t = -45.329, df = 2286, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7090031 -0.6658066
sample estimates:
       cor 
-0.6880138 

    Spearman's rank correlation rho

data:  info.rate.data$SR and info.rate.data$ID
S = 3393600000, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.6999614 

Linear Mixed Models (LMMs)

Linear ICCs

ICC for model NS ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.26
Family 0.42
Language 0.19
Speaker 0.00
ICC for model SR ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.01
Family 0.50
Language 0.19
Speaker 0.24
ICC for model ShIR ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.01
Family 0.47
Language 0.13
Speaker 0.30
ICC for model IR ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.02
Family 0.03
Language 0.32
Speaker 0.49

SR

Comparing models

AIC and BIC for various hierarchical models of SR (in bold are the minimum values).
model AIC BIC
1 + (1 | Text) + (1 | Family/Language) + (1 | Speaker) 2127.51 2161.93
1 + (1 | Family/Language) + (1 | Speaker) 2397.97 2426.65
1 + (1 | Text) + (1 | Speaker) 2257.73 2280.68
1 + (1 | Text) + (1 | Family/Language) 4717.06 4745.73
1 + Sex + (1 | Text) + (1 | Family/Language) + (1 | Speaker) 2121.04 2161.19
1 + Sex + (1 | Family/Language) + (1 | Speaker) 2391.29 2425.7
1 + Sex + (1 | Text) + (1 | Speaker) 2258.63 2287.31
1 + Sex + (1 | Text) + (1 | Family/Language) 4584.11 4618.52

Residuals and random effects

We consider here the full model SR ~ 1 + Sex + (1|Text) + (1|Family/Language) + (1|Speaker).

IR

Comparing models

AIC and BIC for various hierarchical models of IR (in bold are the minimum values).
model AIC BIC
1 + (1 | Text) + (1 | Family/Language) + (1 | Speaker) 10257.84 10292.25
1 + (1 | Family/Language) + (1 | Speaker) 10524.5 10553.18
1 + (1 | Text) + (1 | Speaker) 10305.76 10328.71
1 + (1 | Text) + (1 | Family/Language) 12888.49 12917.17
1 + Sex + (1 | Text) + (1 | Family/Language) + (1 | Speaker) 10247.34 10287.49
1 + Sex + (1 | Family/Language) + (1 | Speaker) 10513.79 10548.21
1 + Sex + (1 | Text) + (1 | Speaker) 10300.05 10328.72
1 + Sex + (1 | Text) + (1 | Family/Language) 12747.04 12781.45

Residuals and random effects

We consider here the full model IR ~ 1 + Sex + (1|Text) + (1|Family/Language) + (1|Speaker).

Generalized Additive Models for Location, Scale and Shape (GAMLSS)

We will use a Gaussian distribution (with fixed or modelled variance).

SR

Fixed or modelled σ (variance)

Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  -7.131581e-05 
                       variance   =  1.000437 
               coef. of skewness  =  0.04303609 
               coef. of kurtosis  =  3.710291 
Filliben correlation coefficient  =  0.9979187 
******************************************************************


Deviance= 1177.818 

AIC= 1576.98 
Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.002001851 
                       variance   =  1.000433 
               coef. of skewness  =  0.03014483 
               coef. of kurtosis  =  2.864355 
Filliben correlation coefficient  =  0.9994151 
******************************************************************


Deviance= 815.9172 

AIC= 1405.705 

The distribution of the residuals is less heteroscedastic than before and the fit to the data better. The full summary of the model is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + Sex + random(Text) + random(Language) +      random(Family) + random(Speaker), sigma.formula = ~1 +      Sex + random(Text) + random(Language) + random(Family) +  
    random(Speaker), family = NO(mu.link = "identity"),      data = d, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.629552   0.005816 1139.83   <2e-16 ***
Sex1        -0.168157   0.005816  -28.91   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.23693    0.01478 -83.673  < 2e-16 ***
Sex1        -0.05788    0.01478  -3.915 9.34e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  2288 
Degrees of Freedom for the fit:  294.8937
      Residual Deg. of Freedom:  1993.106 
                      at cycle:  53 
 
Global Deviance:     815.9172 
            AIC:     1405.705 
            SBC:     3097.048 
******************************************************************

Random effects

μ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.39185 
Random effect parameter sigma_b: 0.109103 
Smoothing parameter lambda     : 84.5402 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 16.98344 
Random effect parameter sigma_b: 0.870621 
Smoothing parameter lambda     : 1.32916 

Family

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.01237646 
Random effect parameter sigma_b: 1.91981e-05 
Smoothing parameter lambda     : 2713240000 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 165.4211 
Random effect parameter sigma_b: 0.569042 
Smoothing parameter lambda     : 3.32891 
σ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.00152528 
Random effect parameter sigma_b: 0.000518285 
Smoothing parameter lambda     : 3475370 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.62296 
Random effect parameter sigma_b: 0.153319 
Smoothing parameter lambda     : 39.9698 

Family

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.001713103 
Random effect parameter sigma_b: 0.000184844 
Smoothing parameter lambda     : 27323100 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 79.45882 
Random effect parameter sigma_b: 0.181484 
Smoothing parameter lambda     : 29.3639 

IR

Fixed or modelled σ (variance)

Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  8.557955e-05 
                       variance   =  1.000437 
               coef. of skewness  =  0.09721211 
               coef. of kurtosis  =  3.684191 
Filliben correlation coefficient  =  0.9978017 
******************************************************************


Deviance= 9322.636 

AIC= 9721.861 
Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.002104815 
                       variance   =  1.000442 
               coef. of skewness  =  0.03139581 
               coef. of kurtosis  =  2.864031 
Filliben correlation coefficient  =  0.9993924 
******************************************************************


Deviance= 8961.553 

AIC= 9554.321 

Again, this is a better fit to the data. The full summary of the model is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = IR ~ 1 + Sex + random(Text) + random(Language) +      random(Family) + random(Speaker), sigma.formula = ~1 +      Sex + random(Text) + random(Language) + random(Family) +  
    random(Speaker), family = NO(mu.link = "identity"),      data = d, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.14451    0.03463 1130.31   <2e-16 ***
Sex1        -1.01064    0.03463  -29.18   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.54554    0.01478  36.903  < 2e-16 ***
Sex1        -0.05935    0.01478  -4.015 6.17e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  2288 
Degrees of Freedom for the fit:  296.3836
      Residual Deg. of Freedom:  1991.616 
                      at cycle:  6 
 
Global Deviance:     8961.553 
            AIC:     9554.321 
            SBC:     11254.21 
******************************************************************

Random effects

μ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.41202 
Random effect parameter sigma_b: 0.660353 
Smoothing parameter lambda     : 2.30777 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 16.95235 
Random effect parameter sigma_b: 3.10086 
Smoothing parameter lambda     : 0.104779 

Family

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.01908253 
Random effect parameter sigma_b: 0.000233421 
Smoothing parameter lambda     : 18354000 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 165.2917 
Random effect parameter sigma_b: 3.39374 
Smoothing parameter lambda     : 0.0935855 
σ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.0004688834 
Random effect parameter sigma_b: 0.000282878 
Smoothing parameter lambda     : 11664800 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.73159 
Random effect parameter sigma_b: 0.157607 
Smoothing parameter lambda     : 37.8208 

Family

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.001453389 
Random effect parameter sigma_b: 0.00011236 
Smoothing parameter lambda     : 73935500 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 80.97495 
Random effect parameter sigma_b: 0.184921 
Smoothing parameter lambda     : 28.2979 

Modelling the relationship between SR and ID

Let’s model SR with ID as an additional predictor (fixed effect) interacting with Sex. N.B. In this case, we must drop Language as a random effect, since each language has, by definition, only one value of ID.

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + ID * Sex + random(Text) +      random(Speaker) + random(Family), sigma.formula = ~1 +      ID + Sex + random(Text) + random(Speaker) + random(Family),  
    family = NO(mu.link = "identity"), data = d, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 11.970094   0.037525  318.987  < 2e-16 ***
ID          -0.888703   0.005992 -148.324  < 2e-16 ***
Sex1        -0.062504   0.037524   -1.666  0.09593 .  
ID:Sex1     -0.017079   0.005991   -2.851  0.00441 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.73603    0.10289  -7.153 1.18e-12 ***
ID          -0.08460    0.01694  -4.993 6.45e-07 ***
Sex1        -0.05723    0.01478  -3.871 0.000112 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  2288 
Degrees of Freedom for the fit:  293.3745
      Residual Deg. of Freedom:  1994.626 
                      at cycle:  8 
 
Global Deviance:     790.2025 
            AIC:     1376.951 
            SBC:     3059.581 
******************************************************************

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.001905342 
                       variance   =  1.000433 
               coef. of skewness  =  0.01947475 
               coef. of kurtosis  =  2.77394 
Filliben correlation coefficient  =  0.9993355 
******************************************************************


Deviance= 790.2025 

AIC= 1376.951 

Adding ID as a predictor improves the fits (as judged by AIC). There is a negative estimate for ID, but significance is difficult to assess with GAMLSS model involving smoothing functions. However, also using a simple lmer model we have a significant effect of ID:

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR ~ 1 + ID * Sex + (1 | Text) + (1 | Speaker) + (1 | Family)
   Data: info.rate.data

REML criterion at convergence: 2121.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.7806 -0.6226  0.0176  0.5808  5.1707 

Random effects:
 Groups   Name        Variance Std.Dev.
 Speaker  (Intercept) 0.4618   0.6795  
 Text     (Intercept) 0.0172   0.1311  
 Family   (Intercept) 0.2443   0.4942  
 Residual             0.1063   0.3260  
Number of obs: 2288, groups:  Speaker, 170; Text, 15; Family, 9

Fixed effects:
              Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)  10.521343   0.637420  42.244565  16.506  < 2e-16 ***
ID           -0.658948   0.101261  54.414768  -6.507 2.52e-08 ***
Sex1         -0.128232   0.367323 155.446204  -0.349    0.727    
ID:Sex1      -0.006959   0.060377 155.565771  -0.115    0.908    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
        (Intr) ID     Sex1  
ID      -0.959              
Sex1     0.002 -0.002       
ID:Sex1 -0.002  0.002 -0.990
Type III Analysis of Variance Table with Satterthwaite's method
       Sum Sq Mean Sq NumDF   DenDF F value    Pr(>F)    
ID     4.4998  4.4998     1  54.415 42.3465 2.516e-08 ***
Sex    0.0130  0.0130     1 155.446  0.1219    0.7275    
ID:Sex 0.0014  0.0014     1 155.566  0.0133    0.9084    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random effects

μ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.41708 
Random effect parameter sigma_b: 0.11065 
Smoothing parameter lambda     : 82.1944 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 167.4872 
Random effect parameter sigma_b: 0.7577 
Smoothing parameter lambda     : 1.8794 

Family

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.01763365 
Random effect parameter sigma_b: 0.000168856 
Smoothing parameter lambda     : 35072900 

σ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.03217714 
Random effect parameter sigma_b: 0.00241144 
Smoothing parameter lambda     : 152754 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 104.4197 
Random effect parameter sigma_b: 0.242522 
Smoothing parameter lambda     : 15.8243 

Family

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.000711841 
Random effect parameter sigma_b: 0.000279392 
Smoothing parameter lambda     : 11379200 

Do Age and Sex matter?

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR ~ Age * Sex + (1 | Text) + (1 | Language)
   Data: info.rate.data

REML criterion at convergence: 3743.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4947 -0.6305  0.0105  0.5978  3.6970 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.01299  0.1140  
 Language (Intercept) 1.01328  1.0066  
 Residual             0.36283  0.6024  
Number of obs: 1979, groups:  Text, 15; Language, 14

Fixed effects:
              Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)  6.809e+00  2.753e-01  1.417e+01  24.734 4.61e-13 ***
Age         -5.972e-03  1.592e-03  1.951e+03  -3.751 0.000181 ***
Sex1        -1.084e-01  4.608e-02  1.948e+03  -2.351 0.018804 *  
Age:Sex1    -1.223e-03  1.440e-03  1.949e+03  -0.849 0.395965    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr) Age    Sex1  
Age      -0.176              
Sex1      0.018 -0.107       
Age:Sex1 -0.019  0.115 -0.956
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: IR ~ Age * Sex + (1 | Text) + (1 | Family/Language)
   Data: info.rate.data

REML criterion at convergence: 10792.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2256 -0.6274  0.0068  0.5901  4.6601 

Random effects:
 Groups          Name        Variance Std.Dev.
 Text            (Intercept)  0.4338  0.6586  
 Language:Family (Intercept)  8.4982  2.9152  
 Family          (Intercept)  2.1402  1.4629  
 Residual                    12.9843  3.6034  
Number of obs: 1979, groups:  Text, 15; Language:Family, 14; Family, 9

Fixed effects:
              Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)  3.996e+01  1.009e+00  1.049e+01  39.596 9.13e-13 ***
Age         -3.985e-02  9.517e-03  1.954e+03  -4.187 2.95e-05 ***
Sex1        -6.705e-01  2.756e-01  1.950e+03  -2.433   0.0151 *  
Age:Sex1    -7.286e-03  8.615e-03  1.950e+03  -0.846   0.3978    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr) Age    Sex1  
Age      -0.285              
Sex1      0.028 -0.106       
Age:Sex1 -0.030  0.114 -0.956

So, it seems Age and Sex are both worth including in our models (even if we have to discard quite a bit of data because of missing Age info). (In fact, the effect of Age seems more significant than that of Sex.)

In the following, we investigate if Age does matter when using GAMLSS modelling…

GAMLSS with Age

Bescause there is missing data fro Age, and because the GAMLSS models require no missing data, we will fit the models with Age (and its interaction with Sex) on the subset of the data that contains only those speakers with Age info. To make comparability possible, we also fit the same models but without Age on the exact same subset of the data.

SR

Fixed or modelled σ (variance)
Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  -8.252732e-05 
                       variance   =  1.000506 
               coef. of skewness  =  0.04585 
               coef. of kurtosis  =  3.821224 
Filliben correlation coefficient  =  0.9973638 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + Sex * Age + random(Text) +      random(Language) + random(Family) + random(Speaker),      family = NO(mu.link = "identity"), data = info.rate.data.for.age,      control = gamlss.control(n.cyc = 800, trace = FALSE),  
    i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.7419779  0.0234704 287.254  < 2e-16 ***
Sex1        -0.0916878  0.0234704  -3.907 9.71e-05 ***
Age         -0.0024831  0.0007319  -3.393 0.000707 ***
Sex1:Age    -0.0017244  0.0007319  -2.356 0.018566 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.1605     0.0159  -73.01   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1979 
Degrees of Freedom for the fit:  161.6924
      Residual Deg. of Freedom:  1817.308 
                      at cycle:  35 
 
Global Deviance:     1022.798 
            AIC:     1346.183 
            SBC:     2250.099 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 1022.8 1346.2
Age + Sex 1022.8 1344.2
Sex 1022.8 1342.2

So, even if Age has a significant (negative) effect and interaction with Sex (positive for males), adding it does not seem to be warranted here…

Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.002353284 
                       variance   =  1.000499 
               coef. of skewness  =  0.02299299 
               coef. of kurtosis  =  2.90592 
Filliben correlation coefficient  =  0.9993823 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + Sex * Age + random(Text) +      random(Language) + random(Family) + random(Speaker),      sigma.formula = ~1 + Sex * Age + random(Text) +          random(Language) + random(Family) + random(Speaker),  
    family = NO(mu.link = "identity"), data = info.rate.data.for.age,      control = gamlss.control(n.cyc = 800, trace = FALSE),      i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.5465239  0.0203601 321.536  < 2e-16 ***
Sex1        -0.0442321  0.0203601  -2.172     0.03 *  
Age          0.0038165  0.0006483   5.887 4.71e-09 ***
Sex1:Age    -0.0030913  0.0006483  -4.768 2.01e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.330351   0.053970 -24.650   <2e-16 ***
Sex1         0.019846   0.053970   0.368   0.7131    
Age          0.003029   0.001685   1.798   0.0724 .  
Sex1:Age    -0.002561   0.001685  -1.520   0.1286    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1979 
Degrees of Freedom for the fit:  243.8639
      Residual Deg. of Freedom:  1735.136 
                      at cycle:  10 
 
Global Deviance:     721.7861 
            AIC:     1209.514 
            SBC:     2572.798 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 721.8 1209.5
Age + Sex 721.5 1206.1
Sex 721.3 1203

So, even if Age has a significant (negative) effect (but no interaction with Sex), adding it does not seem to be warranted here either…

The distribution of the residuals is less heteroscedastic than before and the fit to the data better.

Summary

Thus, for SR, even if there is a hint that Age might affect it negatively (and there might also be an interaction with Sex with a positive effect for males), overall, the various fit indices do not warrant its inclusion in the GAMLSS models.

IR

Fixed or modelled σ (variance)
Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  1.384527e-05 
                       variance   =  1.000506 
               coef. of skewness  =  0.09166289 
               coef. of kurtosis  =  3.764844 
Filliben correlation coefficient  =  0.9974381 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = IR ~ 1 + Sex * Age + random(Text) +      random(Language) + random(Family) + random(Speaker),      family = NO(mu.link = "identity"), data = info.rate.data.for.age,      control = gamlss.control(n.cyc = 800, trace = FALSE),  
    i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.015135   0.137507 291.003  < 2e-16 ***
Sex1        -0.175295   0.137508  -1.275    0.203    
Age         -0.035742   0.004288  -8.336  < 2e-16 ***
Sex1:Age    -0.023187   0.004288  -5.408 7.22e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.6074     0.0159   38.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1979 
Degrees of Freedom for the fit:  161.827
      Residual Deg. of Freedom:  1817.173 
                      at cycle:  2 
 
Global Deviance:     8020.303 
            AIC:     8343.957 
            SBC:     9248.626 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 8020.3 8344
Age + Sex 8020.3 8342
Sex 8020.3 8340

So, even if Age has a significant (negative) effect and interaction with Sex (positive for males) – interestingly, in this case the main effect of Sex disappears –, adding it does not seem to be warranted…

Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.002387788 
                       variance   =  1.00051 
               coef. of skewness  =  0.02686615 
               coef. of kurtosis  =  2.9163 
Filliben correlation coefficient  =  0.9993496 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = IR ~ 1 + Sex * Age + random(Text) +      random(Language) + random(Family) + random(Speaker),      sigma.formula = ~1 + Sex * Age + random(Text) +          random(Language) + random(Family) + random(Speaker),  
    family = NO(mu.link = "identity"), data = info.rate.data.for.age,      control = gamlss.control(n.cyc = 800, trace = FALSE),      i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.041563   0.118455 338.031  < 2e-16 ***
Sex1        -0.299328   0.118455  -2.527   0.0116 *  
Age         -0.036894   0.003708  -9.950  < 2e-16 ***
Sex1:Age    -0.019096   0.003708  -5.150  2.9e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.479445   0.053990   8.880   <2e-16 ***
Sex1         0.025025   0.053990   0.464    0.643    
Age          0.001854   0.001685   1.100    0.271    
Sex1:Age    -0.002747   0.001685  -1.630    0.103    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1979 
Degrees of Freedom for the fit:  244.3689
      Residual Deg. of Freedom:  1734.631 
                      at cycle:  10 
 
Global Deviance:     7726.653 
            AIC:     8215.391 
            SBC:     9581.498 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 7726.7 8215.4
Age + Sex 7726.4 8212.2
Sex 7725.8 8208.8

So, even if Age has a significant (negative) effect and interaction with Sex (positive for males) – interestingly, in this case the main effect of Sex disappears –, adding it does not seem to be warranted…

The distribution of the residuals is less heteroscedastic than before and the fit to the data better.

Summary

Thus, while for IR the hint that Age has a negative main effect and interacts with Sex (with a positive effect for males, containing the whole effect of Sex) is much stronger, the various fit indices do not warrant its inclusion in the GAMLSS models.

Modelling the relationship between SR and ID

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.001744301 
                       variance   =  1.000502 
               coef. of skewness  =  0.008063776 
               coef. of kurtosis  =  2.834364 
Filliben correlation coefficient  =  0.9993747 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + ID + Sex * Age + random(Text) +      random(Speaker) + random(Family), sigma.formula = ~1 +      ID + Sex * Age + random(Text) + random(Speaker) +      random(Family), family = NO(mu.link = "identity"),  
    data = info.rate.data.for.age, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 12.6793628  0.0517065  245.218  < 2e-16 ***
ID          -0.9813533  0.0071683 -136.903  < 2e-16 ***
Sex1         0.0102517  0.0204390    0.502    0.616    
Age         -0.0059906  0.0006663   -8.991  < 2e-16 ***
Sex1:Age    -0.0050056  0.0006537   -7.658 3.12e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.790248   0.133246  -5.931 3.63e-09 ***
ID          -0.086078   0.019128  -4.500 7.24e-06 ***
Sex1         0.046494   0.054368   0.855   0.3926    
Age          0.001961   0.001718   1.141   0.2540    
Sex1:Age    -0.003459   0.001698  -2.037   0.0418 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1979 
Degrees of Freedom for the fit:  236.269
      Residual Deg. of Freedom:  1742.731 
                      at cycle:  8 
 
Global Deviance:     705.2126 
            AIC:     1177.751 
            SBC:     2498.577 
******************************************************************

The compared models are:

Model Deviance AIC
ID * Sex * Age 705.2 1188.4
ID + Sex * Age 705.2 1177.8
ID * Sex + Age 704.2 1178.2
ID + Sex + Age 704.5 1174.5
ID * Sex 704.4 1174.6
ID + Sex 704.6 1170.9

Clearly, adding Age is not warranted here (as is the interaction between ID and Sex)…

As above, we also looked a the simple lmer model:

The compared models are:

Model AIC
ID * Sex * Age 1811.2
ID + Sex * Age 1790.7
ID * Sex + Age 1786.3
ID + Sex + Age 1781
ID * Sex 1777.8
ID + Sex 1772.5

The best model is still the one not including Age:

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR ~ 1 + ID + Sex + (1 | Text) + (1 | Speaker) + (1 | Family)
   Data: info.rate.data.for.age

REML criterion at convergence: 1758.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.8062 -0.6300  0.0174  0.5882  5.2084 

Random effects:
 Groups   Name        Variance Std.Dev.
 Speaker  (Intercept) 0.35876  0.5990  
 Text     (Intercept) 0.01494  0.1222  
 Family   (Intercept) 0.26668  0.5164  
 Residual             0.10575  0.3252  
Number of obs: 1979, groups:  Speaker, 132; Text, 15; Family, 9

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)  10.80777    0.77185  27.12524  14.002 6.21e-14 ***
ID           -0.70707    0.12477  31.07295  -5.667 3.15e-06 ***
Sex1         -0.14683    0.05268 119.92464  -2.787  0.00618 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
     (Intr) ID    
ID   -0.971       
Sex1 -0.001  0.003

Conclusions about Age

Within the limits of our reduced dataset (containing only 132 speakers with Age info), we found the following:

When modelling SR and IR with GAMLSS, while there are hints that Age has, for both, overall:

  • a negative main effect (i.e., older speakers of both sexes are slower and transmit less information per unit time across texts and languages), and
  • an interaction with Sex (positive for males, i.e., males are less negatively affected by age that females across texts and languages),

it does not seem warranted to include it in these models.

When modelling the relationship between SR and ID, this negative relationship:

  • seems strengthened with increasing Age (i.e., older speakers of both sexes shows a stronger negative relationship between SR and ID), and
  • there is an interaction with Sex (for females, this effect of Age is stronger than for males),

but, alas, the inclusion of Age is not warranted in the GAMLSS model, nor (really) in the simpler LMER model.

Thus, while Age seems to negatively influence (in a sex-dependent manner) both SR and IR, as well as strengthen the negative relationship between them, its effects are far from clear in the current dataset.

Is there any relationship between ID, Age and Sex?

Here we test the hypothesis that ID is confounded by Age and Sex structure between languages:

Data: info.rate.data.for.age
Models:
model.ID.age: ID ~ Age + (1 | Family)
model.ID.age.sex: ID ~ 1 + Sex * Age + (1 | Family)
                 Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
model.ID.age      4 1089.1 1111.5 -540.57   1081.1                         
model.ID.age.sex  6 1091.9 1125.5 -539.96   1079.9 1.2113      2     0.5457
Data: info.rate.data.for.age
Models:
model.ID: ID ~ (1 | Family)
model.ID.age.sex: ID ~ 1 + Sex * Age + (1 | Family)
                 Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
model.ID          3 1088.2 1105.0 -541.09   1082.2                         
model.ID.age.sex  6 1091.9 1125.5 -539.96   1079.9 2.2555      3     0.5211
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: ID ~ 1 + Sex * Age + (1 | Family)
   Data: info.rate.data.for.age

REML criterion at convergence: 1113.3

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.17017 -0.67570 -0.01297  0.05397  2.96861 

Random effects:
 Groups   Name        Variance Std.Dev.
 Family   (Intercept) 1.14186  1.0686  
 Residual             0.09778  0.3127  
Number of obs: 1979, groups:  Family, 9

Fixed effects:
              Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)  5.992e+00  3.571e-01  8.071e+00  16.778 1.46e-07 ***
Sex1        -2.587e-02  2.361e-02  1.967e+03  -1.096    0.273    
Age          9.073e-04  8.119e-04  1.967e+03   1.118    0.264    
Sex1:Age     7.923e-04  7.370e-04  1.967e+03   1.075    0.282    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr) Sex1   Age   
Sex1      0.005              
Age      -0.068 -0.087       
Sex1:Age -0.005 -0.955  0.095

Thus, there does not seem to be any relationship between ID, Age and Sex.

Relationship between ID and the syntagmatic density of information ratio (SDIR)

In our case, SDIR (versus Vietnamese) is reduced to the ratio between the number of syllables (NS) in Vietnamese to the NS in the language L for each Text separately; we denote this here also as NSVR (from “NS Vietnamese Ratio”).

SDIR separately for each Text and Language

Syntagmatic density of information ratio *SDIR* (relative to Vietnamese) versus *ID* with LOESS smoother (black) and linear regression (yellow) and their 95%CIs.

Syntagmatic density of information ratio SDIR (relative to Vietnamese) versus ID with LOESS smoother (black) and linear regression (yellow) and their 95%CIs.

The “flat” correlations (Pearson and Spearman) between SDIR and ID are:


    Pearson's product-moment correlation

data:  d$NSVR and d$ID
t = 13.46, df = 253, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5682215 0.7122938
sample estimates:
      cor 
0.6459739 

    Spearman's rank correlation rho

data:  d$NSVR and d$ID
S = 1108400, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.5989002 

The multi-level regression of SDIR on ID (with Text as random effect) and the Text’s ICC are:

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: NSVR ~ ID + (1 | Text)
   Data: d

REML criterion at convergence: -394.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.1594 -0.6786 -0.0734  0.5435  3.8668 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.013819 0.11755 
 Residual             0.009879 0.09939 
Number of obs: 255, groups:  Text, 15

Fixed effects:
              Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)  -0.124714   0.052989  98.407896  -2.354   0.0206 *  
ID            0.146209   0.007138 239.000000  20.484   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
ID -0.811

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: NSVR ~ ID + (1 | Text)

  ICC (Text): 0.5831

Average SDIR across Texts for each Language

If we consider just the average SDIR for each language:

Average syntagmatic density of information ratio *SDIR* (relative to Vietnamese) versus *ID* with LOESS smoother (black) and linear regression (yellow) and their 95%CIs.

Average syntagmatic density of information ratio SDIR (relative to Vietnamese) versus ID with LOESS smoother (black) and linear regression (yellow) and their 95%CIs.


    Pearson's product-moment correlation

data:  d1$NSVR.mean and d1$ID
t = 8.6199, df = 15, p-value = 3.394e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7683980 0.9682841
sample estimates:
      cor 
0.9121585 

    Spearman's rank correlation rho

data:  d1$NSVR.mean and d1$ID
S = 162.6, p-value = 0.0001126
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.8007359 

Call:
lm(formula = NSVR.mean ~ ID, data = d1)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.07917 -0.04788 -0.01277  0.02717  0.13245 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.12471    0.10321  -1.208    0.246    
ID           0.14621    0.01696   8.620 3.39e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06098 on 15 degrees of freedom
Multiple R-squared:  0.832, Adjusted R-squared:  0.8208 
F-statistic:  74.3 on 1 and 15 DF,  p-value: 3.394e-07

Mixture of Gaussians

In what follows, mixing probabilities are independent from factors such as Sex.

SR

Between 1 and 5 Gaussian distributions:

1 component


Mixing Family:  "NO" 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 1, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      6.631  
Sigma Coefficients for model: 1 
(Intercept)  
     0.1378  

Estimated probabilities: 1 

Degrees of Freedom for the fit: 2 Residual Deg. of Freedom   2286 
Global Deviance:     7123.61 
            AIC:     7127.61 
            SBC:     7139.08 

2 components


Mixing Family:  c("NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 2, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
       5.26  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.4778  
Mu Coefficients for model: 2 
(Intercept)  
       7.15  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.1862  

Estimated probabilities: 0.27457 0.72543 

Degrees of Freedom for the fit: 5 Residual Deg. of Freedom   2283 
Global Deviance:     7001.23 
            AIC:     7011.23 
            SBC:     7039.9 

3 components


Mixing Family:  c("NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 3, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      7.257  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.2162  
Mu Coefficients for model: 2 
(Intercept)  
       6.44  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.1447  
Mu Coefficients for model: 3 
(Intercept)  
      5.199  
Sigma Coefficients for model: 3 
(Intercept)  
    -0.5124  

Estimated probabilities: 0.590421 0.1746332 0.2349458 

Degrees of Freedom for the fit: 8 Residual Deg. of Freedom   2280 
Global Deviance:     7001.95 
            AIC:     7017.95 
            SBC:     7063.84 

4 components


Mixing Family:  c("NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 4, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      7.012  
Sigma Coefficients for model: 1 
(Intercept)  
   -0.03502  
Mu Coefficients for model: 2 
(Intercept)  
      5.323  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.4349  
Mu Coefficients for model: 3 
(Intercept)  
      7.257  
Sigma Coefficients for model: 3 
(Intercept)  
   -0.09049  
Mu Coefficients for model: 4 
(Intercept)  
        7.2  
Sigma Coefficients for model: 4 
(Intercept)  
    -0.6349  

Estimated probabilities: 0.231015 0.2886475 0.2650284 0.2153091 

Degrees of Freedom for the fit: 11 Residual Deg. of Freedom   2277 
Global Deviance:     6996.59 
            AIC:     7018.59 
            SBC:     7081.68 

5 components


Mixing Family:  c("NO", "NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 5, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      6.772  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.1062  
Mu Coefficients for model: 2 
(Intercept)  
      7.611  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.2516  
Mu Coefficients for model: 3 
(Intercept)  
      7.207  
Sigma Coefficients for model: 3 
(Intercept)  
     -1.041  
Mu Coefficients for model: 4 
(Intercept)  
      5.339  
Sigma Coefficients for model: 4 
(Intercept)  
    -0.4062  
Mu Coefficients for model: 5 
(Intercept)  
      6.852  
Sigma Coefficients for model: 5 
(Intercept)  
    -0.1055  

Estimated probabilities: 0.1783584 0.2294359 0.1266083 0.2811023 0.1844951 

Degrees of Freedom for the fit: 14 Residual Deg. of Freedom   2274 
Global Deviance:     6988.48 
            AIC:     7016.48 
            SBC:     7096.78 

Comparing AIC

            df      AIC
mix.SR.NO.2  5 7011.225
mix.SR.NO.5 14 7016.484
mix.SR.NO.3  8 7017.954
mix.SR.NO.4 11 7018.591
mix.SR.NO.1  2 7127.609

Showing the distributions

Mixture of Gaussians for SR.

Mixture of Gaussians for SR.

IR

Between 1 and 5 Gaussian distributions:

1 component


Mixing Family:  "NO" 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 1, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      39.15  
Sigma Coefficients for model: 1 
(Intercept)  
      1.629  

Estimated probabilities: 1 

Degrees of Freedom for the fit: 2 Residual Deg. of Freedom   2286 
Global Deviance:     13945.2 
            AIC:     13949.2 
            SBC:     13960.7 

2 components


Mixing Family:  c("NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 2, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      41.16  
Sigma Coefficients for model: 1 
(Intercept)  
      1.861  
Mu Coefficients for model: 2 
(Intercept)  
      38.38  
Sigma Coefficients for model: 2 
(Intercept)  
      1.442  

Estimated probabilities: 0.2770978 0.7229022 

Degrees of Freedom for the fit: 5 Residual Deg. of Freedom   2283 
Global Deviance:     13895 
            AIC:     13905 
            SBC:     13933.7 

3 components


Mixing Family:  c("NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 3, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      40.01  
Sigma Coefficients for model: 1 
(Intercept)  
     0.9221  
Mu Coefficients for model: 2 
(Intercept)  
      42.61  
Sigma Coefficients for model: 2 
(Intercept)  
      1.713  
Mu Coefficients for model: 3 
(Intercept)  
      35.75  
Sigma Coefficients for model: 3 
(Intercept)  
      1.374  

Estimated probabilities: 0.2997471 0.3102765 0.3899765 

Degrees of Freedom for the fit: 8 Residual Deg. of Freedom   2280 
Global Deviance:     13875.7 
            AIC:     13891.7 
            SBC:     13937.6 

4 components


Mixing Family:  c("NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 4, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      39.59  
Sigma Coefficients for model: 1 
(Intercept)  
     0.7231  
Mu Coefficients for model: 2 
(Intercept)  
      42.84  
Sigma Coefficients for model: 2 
(Intercept)  
      1.772  
Mu Coefficients for model: 3 
(Intercept)  
      40.16  
Sigma Coefficients for model: 3 
(Intercept)  
      1.364  
Mu Coefficients for model: 4 
(Intercept)  
      34.43  
Sigma Coefficients for model: 4 
(Intercept)  
      1.254  

Estimated probabilities: 0.2027138 0.2280114 0.3066644 0.2626103 

Degrees of Freedom for the fit: 11 Residual Deg. of Freedom   2277 
Global Deviance:     13870.5 
            AIC:     13892.5 
            SBC:     13955.6 

5 components


Mixing Family:  c("NO", "NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 5, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      34.41  
Sigma Coefficients for model: 1 
(Intercept)  
      1.261  
Mu Coefficients for model: 2 
(Intercept)  
      43.85  
Sigma Coefficients for model: 2 
(Intercept)  
       1.77  
Mu Coefficients for model: 3 
(Intercept)  
      39.55  
Sigma Coefficients for model: 3 
(Intercept)  
     0.5825  
Mu Coefficients for model: 4 
(Intercept)  
      40.27  
Sigma Coefficients for model: 4 
(Intercept)  
       1.35  
Mu Coefficients for model: 5 
(Intercept)  
      40.07  
Sigma Coefficients for model: 5 
(Intercept)  
      1.376  

Estimated probabilities: 0.2666359 0.1665143 0.1540443 0.2038578 0.2089478 

Degrees of Freedom for the fit: 14 Residual Deg. of Freedom   2274 
Global Deviance:     13868.7 
            AIC:     13896.7 
            SBC:     13977 

Comparing AIC

            df      AIC
mix.IR.NO.3  8 13891.72
mix.IR.NO.4 11 13892.47
mix.IR.NO.5 14 13896.74
mix.IR.NO.2  5 13905.05
mix.IR.NO.1  2 13949.21

Showing the distributions

Mixture of Gaussians for IR.

Mixture of Gaussians for IR.

Tests of unimodality

We used three ways to estimate how unimodal a distribution is, as they tend to disagree and the problem of unimodality testing is far from settled (see Freeman & Dale, 2013):

  • the Silverman test tests the null hypothesis that an underlying density has at most k modes. The null hypothesis is that the underlying density has at most k modes (H0: number of modes <= k). The result is the p-value (bootstrapped) of rejecting a unimodal distribution. It is described in Silverman (1981) and Hall & York (2001); our implementation is based on the code available at https://www.mathematik.uni-marburg.de/~stochastik/R_packages/;
  • the dip test which computes Hartigans’ dip statistic D_n_, and its p-value for the test for unimodality, by interpolating tabulated quantiles of sqrt(n) • D_n_. For X_i_F, i.i.d., the null hypothesis is that F is a unimodal distribution. The result is the D metric and the p-value (interpolated) of rejecting a unimodal distribution. See Hartigan (1985) and Hartigan & Hartigan (1985); it is implemented in package diptest;
  • the bimodality coefficient (BC) which is based on an empirical relationship between bimodality and the third and fourth statistical moments of a distribution (skewness and kurtosis). It is proportional to the division of squared skewness with uncorrected kurtosis, BC ≈ (s2 + 1)/k, with the underlying logic that a bimodal distribution will have very low kurtosis, an asymmetric character, or both; all of these conditions increase BC. The values range from 0 and 1, with those exceeding .555 (the value representing a uniform distribution) suggesting bi-modality. The result is the DC estimate (which must exceed 0.555 to reject a unimodal distribution). We implemented it following Freeman & Dale (2013) as BC = (s2 + 1)/(k + 3 • ((n-1)2 / ((n-2) • (n-3)))).

For each such test, we performed four randomisation procedures to obtain an estimate of the “specialness” of the observed unimodality estimate; for each new permuted dataset, we recompute everything before estimating the unimodlaity of the permuted distribution:

  • Permutation model 1 (PM1): randomly permute the SR values freely among speakers, texts and languages;
  • Permutation model 2 (PM2): randomly permute the ID values among languages;
  • Permutation model 3 (PM3): randomly permute the Speaker average SR values among speakers (irrespective of languages);
  • Permutation model 4 (PM4): randomly permute the Language average SR values among languages.

Visual comparison

The observed estimate (vertical blue solid line), the permuted distribution (gray histogram), and the “unimodality region” (shaded green rectangle) are shown below (for PM3, we also show the original estimate using the Speaker average SR as a vertical solid red line).

PM1

**Permutation of the texts' SRs (PM1)**.

Permutation of the texts’ SRs (PM1).

PM2

**Permutation of the languages' ID (PM2)**.

Permutation of the languages’ ID (PM2).

PM3

**Permutation of the speakers' average SRs (PM3)**.

Permutation of the speakers’ average SRs (PM3).

PM4

**Permutation of the languages' average SRs with speaker adjustement (PM4)**.

Permutation of the languages’ average SRs with speaker adjustement (PM4).

Summary of unimodality tests

Summary of unimodality permutation tests.
Scenario Measure Test Observed estimate (p-value) % more unimodal permutations
PM1 SR Silverman - (0.024) 55.5%
PM1 SR Dip 0.005 (0.984) * 100%
PM1 SR BC 0.19 () * 100%
PM1 IR Silverman - (0.835) * 15.3%
PM1 IR Dip 0.005 (0.992) * 71%
PM1 IR BC 0.167 () * 100%
PM2 SR Silverman - (0.024) 56.4%
PM2 SR Dip 0.005 (0.984) * 100%
PM2 SR BC 0.19 () * 100%
PM2 IR Silverman - (0.835) * 2.7%
PM2 IR Dip 0.005 (0.992) * 25.9%
PM2 IR BC 0.167 () * 97.3%
PM3 SR Silverman - (0.024) 100%
PM3 SR Dip 0.005 (0.984) * 0%
PM3 SR BC 0.19 () * 100%
PM3 IR Silverman - (0.835) * 16.8%
PM3 IR Dip 0.005 (0.992) * 17.5%
PM3 IR BC 0.167 () * 100%
PM4 SR Silverman - (0.024) 4.1%
PM4 SR Dip 0.005 (0.984) * 1.5%
PM4 SR BC 0.19 () * 82.8%
PM4 IR Silverman - (0.835) * 0.7%
PM4 IR Dip 0.005 (0.992) * 13%
PM4 IR BC 0.167 () * 96.8%

Pair-wise distances between languages

We compute various distances between languages (as implemented by function distance() in package philentropy) in what concerns the distribution of NS, SR and ID.

Comparing the distribution of pairwise distances between languages.

Comparing the distribution of pairwise distances between languages.

Paired permutation t-tests comparing measures with 1,000 permutations.
m1 m2 d mean1 median1 sd1 mean2 median2 sd2 p
IR NS Hellinger 0.88 0.83 0.32 1.20 1.19 0.39 0.00
IR NS Jensen-Shannon 0.17 0.14 0.12 0.29 0.26 0.16 0.00
IR NS Kolmogorov–Smirnov 0.42 0.37 0.20 0.57 0.57 0.23 0.00
IR NS Kullback-Leibler 7.13 4.22 7.81 15.42 13.12 13.31 0.00
IR NS Squared-Chi 0.56 0.45 0.36 0.88 0.79 0.47 0.00
IR SR Hellinger 0.88 0.83 0.32 1.10 1.06 0.48 0.00
IR SR Jensen-Shannon 0.17 0.14 0.12 0.27 0.23 0.20 0.00
IR SR Kolmogorov–Smirnov 0.42 0.37 0.20 0.56 0.55 0.27 0.00
IR SR Kullback-Leibler 7.13 4.22 7.81 12.80 6.69 14.88 0.00
IR SR Squared-Chi 0.56 0.45 0.36 0.86 0.77 0.57 0.00
NS SR Hellinger 1.20 1.19 0.39 1.10 1.06 0.48 0.01
NS SR Jensen-Shannon 0.29 0.26 0.16 0.27 0.23 0.20 0.30
NS SR Kolmogorov–Smirnov 0.57 0.57 0.23 0.56 0.55 0.27 0.81
NS SR Kullback-Leibler 15.42 13.12 13.31 12.80 6.69 14.88 0.05
NS SR Squared-Chi 0.88 0.79 0.47 0.86 0.77 0.57 0.62

References

Campione, E., & Véronis, J. (1998). A multilingual prosodic database, Proc. of the 5th International Conference on Spoken Language Pro cessing (ICSLP’98), Sydney, Australia, 3163-3166.

Freeman, J. B., & Dale, R. (2013). Assessing bimodality to detect the presence of a dual cognitive process. Behavior research methods, 45(1), 83-97.

Hall, P., & York, M. (2001). On the calibration of Silverman’s test for multimodality. Statistica Sinica, 11, 515-536.

Hartigan, J. A., & Hartigan, P. M. (1985) The Dip Test of Unimodality. Annals of Statistics 13, 70–84.

Hartigan, P. M. (1985) Computation of the Dip Statistic to Test for Unimodality. Applied Statistics (JRSS C) 34, 320–325.

Le, V. B., Tran, D. D., Castelli, E., Besacier, L., & Serignat, J. F. (2004). Spoken and Written Language Resources for Vietnamese. In LREC. 4, pp. 599-602.

Lyding, V., Stemle, E., Borghetti, C., Brunello, M., Castagnoli, S., Dell’Orletta, F., Dittmann, H., Lenci, A., & Pirrelli, V. (2014). The PAISÀ Corpus of Italian Web Texts. In Proceedings of the 9th Web as Corpus Workshop (WaC-9). Association for Computational Linguistics, Gothenburg, Sweden, 36-43.

New B., Pallier C., Ferrand L., & Matos R. (2001). Une base de données lexicales du français contemporain sur internet: LEXIQUE 3.80, L’Année Psychologique, 101, 447-462. http://www.lexique.org.

Oh, Y. M. (2015). Linguistic complexity and information: quantitative approaches. PhD Thesis, Université de Lyon, France. Retrieved from http://www.afcp-parole.org/doc/theses/these_YMO15.pdf

Perea, M., Urkia, M., Davis, C. J., Agirre, A., Laseka, E., & Carreiras, M. (2006). E-Hitz: A word frequency list and a program for deriving psycholinguistic statistics in an agglutinative language (Basque). Behavior Research Methods, 38(4), 610-615.

Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In Baroni, M. and Bernardini, S. (Eds.) WaCky! Working papers on the web as corpus, Gedit, Bologna, http://corpus.leeds.ac.uk/queryzh.html.

Silverman, B.W. (1981). Using Kernel Density Estimates to investigate Multimodality. Journal of the Royal Statistical Society, Series B, 43, 97-99.

Váradi, T. (2002). The Hungarian National Corpus. In LREC.

Zséder, A., Recski, G., Varga, D., & Kornai, A. (2012). Rapid creation of large-scale corpora and frequency dictionaries. In Proceedings to LREC 2012.

Appendix I: R session info

This document was compiled on:

R version 3.4.4 (2018-03-15)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: grid, parallel, splines, stats, graphics, grDevices, datasets, utils, methods and base

other attached packages: broman(v.0.69-5), philentropy(v.0.3.0), pander(v.0.6.3), moments(v.0.14), sjPlot(v.2.6.3), sjstats(v.0.17.4), gamlss.mx(v.4.3-5), nnet(v.7.3-12), gamlss(v.5.1-3), nlme(v.3.1-139), gamlss.dist(v.5.1-3), MASS(v.7.3-51.4), gamlss.data(v.5.1-3), lmerTest(v.3.1-0), lme4(v.1.1-21), Matrix(v.1.2-17), plyr(v.1.8.4), reshape2(v.1.4.3), ggrepel(v.0.8.0), ggplot2(v.3.1.1) and RhpcBLASctl(v.0.18-205)

loaded via a namespace (and not attached): tidyr(v.0.8.3), modelr(v.0.1.4), assertthat(v.0.2.1), highr(v.0.8), yaml(v.2.2.0), bayestestR(v.0.1.0), numDeriv(v.2016.8-1), pillar(v.1.3.1), backports(v.1.1.4), lattice(v.0.20-38), glue(v.1.3.1), digest(v.0.6.18), glmmTMB(v.0.2.3), minqa(v.1.2.4), colorspace(v.1.4-1), sandwich(v.2.5-1), htmltools(v.0.3.6), psych(v.1.8.12), pkgconfig(v.2.0.2), broom(v.0.5.2), haven(v.2.1.0), purrr(v.0.3.2), xtable(v.1.8-4), mvtnorm(v.1.0-8), scales(v.1.0.0), emmeans(v.1.3.4), tibble(v.2.1.1), generics(v.0.0.2), sjlabelled(v.1.0.17), TH.data(v.1.0-10), withr(v.2.1.2), TMB(v.1.7.15), lazyeval(v.0.2.2), mnormt(v.1.5-5), survival(v.2.44-1.1), magrittr(v.1.5), crayon(v.1.3.4), estimability(v.1.3), evaluate(v.0.13), foreign(v.0.8-71), forcats(v.0.4.0), tools(v.3.4.4), hms(v.0.4.2), multcomp(v.1.4-8), stringr(v.1.4.0), munsell(v.0.5.0), ggeffects(v.0.9.0), compiler(v.3.4.4), rlang(v.0.3.4), nloptr(v.1.2.1), labeling(v.0.3), rmarkdown(v.1.12), boot(v.1.3-22), gtable(v.0.3.0), codetools(v.0.2-16), sjmisc(v.2.7.9), R6(v.2.4.0), zoo(v.1.8-5), knitr(v.1.22), dplyr(v.0.8.0.1), performance(v.0.1.0), insight(v.0.2.0), stringi(v.1.4.3), Rcpp(v.1.0.1), tidyselect(v.0.2.5), xfun(v.0.6) and coda(v.0.19-2)

Appendix II: Speech Rate: Canonical vs. automatic detection

Here we compare the Speech Rate (SR) used in paper and defined as the canonical articulatory rate (= the number of syllables corresponding to the canonical text pronunciation per second of speech) with an estimate of the realized speech rate, based on the automatic detection of syllable nuclei implemented by the popular algorithm described in De Jong, Nivja H., and Ton Wempe. “Praat script to detect syllable nuclei and measure speech rate automatically.” Behavior research methods 41.2 (2009): 385-390. We used their algorithm with standard parameters (except for pauses which were retrieved from our manual annotation to match the main analysis) on the actual oral productions of our speakers, and the results are available in the TAB-separated CSV file AutomaticSylDetect.csv with the following structure:

  • soundname: combination of Language, Text and Speaker info
  • nsyll: number of syllables
  • npause: number of pauses
  • dur: total duration
  • phonationtime: actual duration

NS (# of syllables):

Pearson’s correlation and paired t-test:


    Pearson's product-moment correlation

data:  syll.data$NS and syll.data$NS.auto
t = 57.58, df = 2286, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7520760 0.7855572
sample estimates:
      cor 
0.7693444 

    Spearman's rank correlation rho

data:  syll.data$NS and syll.data$NS.auto
S = 455450000, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.7718458 

    Paired t-test

data:  syll.data$NS and syll.data$NS.auto
t = 59.48, df = 2287, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 19.08661 20.38804
sample estimates:
mean of the differences 
               19.73733 
NS: canonical (x axis) vs automatic (y axis) overall (black) and separately by language (colored).

NS: canonical (x axis) vs automatic (y axis) overall (black) and separately by language (colored).

NS: canonical (x axis) vs automatic (y axis) overall (black) and separately by text (colored).

NS: canonical (x axis) vs automatic (y axis) overall (black) and separately by text (colored).

NS: canonical (x axis) vs automatic (y axis) separately by text and language.

NS: canonical (x axis) vs automatic (y axis) separately by text and language.

SR (speech rate):

Pearson’s correlation and paired t-test:


    Pearson's product-moment correlation

data:  syll.data$SR and syll.data$SR.auto
t = 27.135, df = 2286, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4619458 0.5239624
sample estimates:
      cor 
0.4935813 

    Spearman's rank correlation rho

data:  syll.data$SR and syll.data$SR.auto
S = 1059600000, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4691952 

    Paired t-test

data:  syll.data$SR and syll.data$SR.auto
t = 64.226, df = 2287, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.302039 1.384054
sample estimates:
mean of the differences 
               1.343047 
SR: canonical (x axis) vs automatic (y axis) overall (black) and separately by language (colored).

SR: canonical (x axis) vs automatic (y axis) overall (black) and separately by language (colored).

SR: canonical (x axis) vs automatic (y axis) overall (black) and separately by text (colored).

SR: canonical (x axis) vs automatic (y axis) overall (black) and separately by text (colored).

SR: canonical (x axis) vs automatic (y axis) separately by text and language.

SR: canonical (x axis) vs automatic (y axis) separately by text and language.

SR: canonical (x axis) vs automatic (y axis) separately by speaker.

SR: canonical (x axis) vs automatic (y axis) separately by speaker.

Canonical and automatic SR by language

Plot, linear (mixed-effects) regression, correlation and paired t-tests:

SR: canonical (x axis) vs automatic (y axis) separately by language with regression line (black) and LOESS smoothing (yellow) and their 95%CIs.

SR: canonical (x axis) vs automatic (y axis) separately by language with regression line (black) and LOESS smoothing (yellow) and their 95%CIs.

Across languages:

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Family/Language) + (1 | Text) + (1 | Speaker)
   Data: syll.data

REML criterion at convergence: 1044.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4153 -0.6165 -0.0139  0.6006  3.4939 

Random effects:
 Groups          Name        Variance  Std.Dev. 
 Speaker         (Intercept) 8.152e-02 2.855e-01
 Language:Family (Intercept) 4.036e-02 2.009e-01
 Text            (Intercept) 1.510e-03 3.886e-02
 Family          (Intercept) 1.542e-10 1.242e-05
 Residual                    7.367e-02 2.714e-01
Number of obs: 2288, groups:  Speaker, 170; Language:Family, 17; Text, 15; Family, 9

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept) 3.657e+00  1.167e-01 2.102e+02   31.34   <2e-16 ***
SR          2.425e-01  1.556e-02 1.269e+03   15.58   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.883
convergence code: 0
boundary (singular) fit: see ?isSingular

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Family/Language) + (1 | Text) + (1 | Speaker)

          ICC (Speaker): 0.4137
  ICC (Language:Family): 0.2048
             ICC (Text): 0.0077
           ICC (Family): 0.0000

For each language separately:


For *CAT*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 31.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.0701 -0.6506 -0.0219  0.4436  3.1370 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.01294  0.1137  
 Speaker  (Intercept) 0.11321  0.3365  
 Residual             0.04915  0.2217  
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)   3.6072     0.4748 141.6269   7.598 3.74e-12 ***
SR            0.2751     0.0653 146.8828   4.213 4.38e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.972

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.0738
  ICC (Speaker): 0.6458


For *CMN*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 70.4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.83050 -0.69715 -0.01563  0.77666  2.44202 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.009625 0.09811 
 Speaker  (Intercept) 0.076741 0.27702 
 Residual             0.069251 0.26316 
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  3.30672    0.45369 76.44565   7.289 2.43e-10 ***
SR           0.31115    0.07578 83.31162   4.106 9.36e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.978

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.0619
  ICC (Speaker): 0.4931


For *DEU*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 30.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.1814 -0.5561  0.1037  0.4896  1.7387 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.008901 0.09435 
 Speaker  (Intercept) 0.049761 0.22307 
 Residual             0.056883 0.23850 
Number of obs: 75, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  3.27935    0.40582 19.85609   8.081 1.05e-07 ***
SR           0.25498    0.06536 20.50531   3.901 0.000854 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.980

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.0770
  ICC (Speaker): 0.4307


For *ENG*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 16

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.58999 -0.59293  0.02286  0.47987  2.70002 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.025424 0.1594  
 Speaker  (Intercept) 0.001998 0.0447  
 Residual             0.051065 0.2260  
Number of obs: 60, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error     df t value Pr(>|t|)    
(Intercept)   2.8381     0.3514 8.2792   8.076 3.35e-05 ***
SR            0.3293     0.0547 8.4683   6.021 0.000252 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.989

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.3239
  ICC (Speaker): 0.0255


For *EUS*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 88.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.5915 -0.6106 -0.0515  0.5333  2.1938 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.02698  0.1643  
 Speaker  (Intercept) 0.05815  0.2411  
 Residual             0.07462  0.2732  
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  3.45341    0.47576 68.91222   7.259 4.54e-10 ***
SR           0.29879    0.06196 74.49606   4.823 7.31e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.982
convergence code: 0
Model failed to converge with max|grad| = 0.00271745 (tol = 0.002, component 1)


Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.1689
  ICC (Speaker): 0.3640


For *FIN*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 94

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.77451 -0.71674 -0.06606  0.55614  2.37964 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.02740  0.1655  
 Speaker  (Intercept) 0.08983  0.2997  
 Residual             0.07605  0.2758  
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)   4.12836    0.55589 103.30284   7.427  3.3e-11 ***
SR            0.16954    0.07608 110.66493   2.228   0.0279 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.982

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.1417
  ICC (Speaker): 0.4648


For *FRA*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 70.4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-3.00987 -0.62700  0.09629  0.62996  3.09995 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.02005  0.1416  
 Speaker  (Intercept) 0.05159  0.2271  
 Residual             0.06725  0.2593  
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  3.84709    0.45106 83.05359   8.529  5.7e-13 ***
SR           0.25609    0.06447 89.52475   3.972 0.000144 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.983

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.1444
  ICC (Speaker): 0.3715


For *HUN*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: -8.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2830 -0.5859 -0.0511  0.6111  2.8044 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.001638 0.04047 
 Speaker  (Intercept) 0.055970 0.23658 
 Residual             0.042307 0.20569 
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  2.80958    0.38240 44.61013   7.347 3.28e-09 ***
SR           0.40933    0.06382 46.94630   6.414 6.36e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.979

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.0164
  ICC (Speaker): 0.5602


For *ITA*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: -3.2

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.84258 -0.54979 -0.02642  0.70407  1.58134 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.00000  0.0000  
 Speaker  (Intercept) 0.04278  0.2068  
 Residual             0.03439  0.1854  
Number of obs: 54, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  2.25739    0.38812 16.47267   5.816 2.34e-05 ***
SR           0.42568    0.05386 17.06472   7.903 4.20e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.983
convergence code: 0
boundary (singular) fit: see ?isSingular


Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.0000
  ICC (Speaker): 0.5543


For *JPN*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 53.3

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.52220 -0.60009 -0.04425  0.55847  2.53058 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.04229  0.2056  
 Speaker  (Intercept) 0.06345  0.2519  
 Residual             0.05458  0.2336  
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)   3.5877     0.6604 105.4468   5.432  3.6e-07 ***
SR            0.1836     0.0813 110.3256   2.258   0.0259 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.989

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.2638
  ICC (Speaker): 0.3958


For *KOR*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 36.8

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.18322 -0.60061 -0.05576  0.67550  3.04396 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.02043  0.1429  
 Speaker  (Intercept) 0.07645  0.2765  
 Residual             0.05048  0.2247  
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  2.91873    0.44492 78.41047   6.560 5.25e-09 ***
SR           0.34066    0.06101 88.83628   5.584 2.53e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.976

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.1387
  ICC (Speaker): 0.5188


For *SPA*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 63.8

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.34549 -0.53648 -0.00435  0.60902  2.43317 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.01698  0.1303  
 Speaker  (Intercept) 0.18965  0.4355  
 Residual             0.05947  0.2439  
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)   4.28465    0.54626 121.79396   7.844 1.88e-12 ***
SR            0.19527    0.06823 118.13840   2.862  0.00498 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.965

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.0638
  ICC (Speaker): 0.7127


For *SRP*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 48.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.8756 -0.6255 -0.1238  0.6560  2.1764 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.03592  0.1895  
 Speaker  (Intercept) 0.06958  0.2638  
 Residual             0.05301  0.2302  
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  3.69692    0.49182 78.87158   7.517 7.63e-11 ***
SR           0.22437    0.06736 86.05809   3.331  0.00128 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.980

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.2266
  ICC (Speaker): 0.4390


For *THA*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 31.1

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.65271 -0.62596 -0.00861  0.63282  2.20438 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.00178  0.04219 
 Speaker  (Intercept) 0.07571  0.27515 
 Residual             0.05556  0.23571 
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  3.17088    0.35727 91.31262   8.875 5.61e-14 ***
SR           0.37101    0.07357 99.07802   5.043 2.07e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.968

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.0134
  ICC (Speaker): 0.5690


For *TUR*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 48.5

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.12994 -0.62360 -0.02388  0.60274  2.19757 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.02108  0.1452  
 Speaker  (Intercept) 0.07852  0.2802  
 Residual             0.05500  0.2345  
Number of obs: 149, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  4.23116    0.41457 70.03009  10.206 1.69e-15 ***
SR           0.14471    0.05716 77.68545   2.532   0.0134 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.972

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.1364
  ICC (Speaker): 0.5079


For *VIE*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 47.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0095 -0.5369 -0.0324  0.5574  2.8577 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.005367 0.07326 
 Speaker  (Intercept) 0.100293 0.31669 
 Residual             0.059273 0.24346 
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  2.39971    0.43084 71.69691   5.570 4.19e-07 ***
SR           0.50664    0.07884 84.71528   6.426 7.33e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.971

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.0325
  ICC (Speaker): 0.6081


For *YUE*

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)
   Data: d

REML criterion at convergence: 3.5

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-3.11616 -0.57419 -0.04198  0.62276  2.30008 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.01674  0.1294  
 Speaker  (Intercept) 0.03228  0.1797  
 Residual             0.04201  0.2050  
Number of obs: 150, groups:  Text, 15; Speaker, 10

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)   2.3392     0.3532 144.9789   6.623 6.41e-10 ***
SR            0.4951     0.0622 147.6043   7.960 4.20e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
   (Intr)
SR -0.981

Intraclass Correlation Coefficient for Linear mixed model

Family : gaussian (identity)
Formula: SR.auto ~ SR + (1 | Text) + (1 | Speaker)

     ICC (Text): 0.1839
  ICC (Speaker): 0.3546
Correlations (Pearson’s and Spearman’s), paired t-test, and (mixed-effects, with Text, Speaker and, for the first row only, Language, as random effects) linear regression intercept and slope between automatic and canonical SR across all languages (top row) and for each language spearately).
Language Pearson’s r Spearman’s rho Paired t-test Intercept Slope
all r=0.49 (p=8.445e-141) rho=0.47 (p=1.327e-125) t(2287.0)=64.23 (p=0) 3.66 (p=3.72e-81) 0.24 (p=3.1e-50)
CAT r=0.21 (p=0.009915) rho=0.12 (p=0.1506) t(149.0)=32.05 (p=9.71e-69) 3.61 (p=3.74e-12) 0.28 (p=4.38e-05)
CMN r=0.61 (p=1.557e-16) rho=0.58 (p=4.102e-15) t(149.0)=17.88 (p=6.52e-39) 3.31 (p=2.43e-10) 0.31 (p=9.36e-05)
DEU r=0.54 (p=6.05e-07) rho=0.58 (p=3.847e-08) t(74.0)=13.19 (p=4.09e-21) 3.28 (p=1.05e-07) 0.25 (p=0.000854)
ENG r=0.60 (p=5.205e-07) rho=0.60 (p=3.885e-07) t(59.0)=20.08 (p=4.63e-28) 2.84 (p=3.35e-05) 0.33 (p=0.000252)
EUS r=0.58 (p=5.564e-15) rho=0.57 (p=1.539e-14) t(149.0)=34.65 (p=3.43e-73) 3.45 (p=4.54e-10) 0.30 (p=7.31e-06)
FIN r=0.16 (p=0.0509) rho=0.12 (p=0.1481) t(149.0)=32.16 (p=6.37e-69) 4.13 (p=3.3e-11) 0.17 (p=0.0279)
FRA r=0.41 (p=2.346e-07) rho=0.43 (p=3.183e-08) t(149.0)=24.94 (p=4.79e-55) 3.85 (p=5.7e-13) 0.26 (p=0.000144)
HUN r=0.73 (p=5.921e-26) rho=0.76 (p=8.644e-30) t(149.0)=16.80 (p=3.31e-36) 2.81 (p=3.28e-09) 0.41 (p=6.36e-08)
ITA r=0.92 (p=2.878e-22) rho=0.93 (p=0) t(53.0)=24.40 (p=1.7e-30) 2.26 (p=2.34e-05) 0.43 (p=4.2e-07)
JPN r=-0.07 (p=0.3829) rho=-0.05 (p=0.5309) t(149.0)=55.12 (p=5.28e-101) 3.59 (p=3.6e-07) 0.18 (p=0.0259)
KOR r=0.43 (p=3.817e-08) rho=0.48 (p=3.552e-10) t(149.0)=30.44 (p=7.77e-66) 2.92 (p=5.25e-09) 0.34 (p=2.53e-07)
SPA r=0.10 (p=0.2028) rho=0.16 (p=0.05311) t(149.0)=36.69 (p=1.66e-76) 4.28 (p=1.88e-12) 0.20 (p=0.00498)
SRP r=0.51 (p=2.448e-11) rho=0.47 (p=1.443e-09) t(149.0)=36.54 (p=2.91e-76) 3.70 (p=7.63e-11) 0.22 (p=0.00128)
THA r=0.67 (p=1.331e-20) rho=0.64 (p=0) t(149.0)=-6.94 (p=1.1e-10) 3.17 (p=5.61e-14) 0.37 (p=2.07e-06)
TUR r=0.16 (p=0.0467) rho=0.16 (p=0.05116) t(148.0)=25.12 (p=3.01e-55) 4.23 (p=1.69e-15) 0.14 (p=0.0134)
VIE r=0.67 (p=3.979e-21) rho=0.61 (p=8.736e-17) t(149.0)=5.10 (p=1.03e-06) 2.40 (p=4.19e-07) 0.51 (p=7.33e-09)
YUE r=0.45 (p=5.825e-09) rho=0.43 (p=4.186e-08) t(149.0)=15.22 (p=3.85e-32) 2.34 (p=6.41e-10) 0.50 (p=4.2e-13)

Appendix III: Figures for the main paper

Here we generate the figures used in the main paper (saved to the ./figures folder as 600 DPI TIFF files Figure-*.tiff).